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Abstract 

We consider the problem of collaborative filtering from a channel coding perspective. We model the 
underlying rating matrix as a finite alphabet matrix with block constant structure. The observations are 
obtained from this underlying matrix through a discrete memoryless channel with a noisy part representing 
noisy user behavior and an erasure part representing missing data. Moreover, the clusters over which 
the underlying matrix is constant are unknown. We establish a sharp threshold result for this model: if 

Preliminary results related to this submission were presented by us in [2] (ISIT 2009, Seoul, Korea). 
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the largest cluster size is smaller than Ci log(m?i) (where the rating matrix is of size m x n), then the 
underlying matrix cannot be recovered with any estimator, but if the smallest cluster size is larger than 
C2 log(mn), then we show a polynomial time estimator with diminishing probability of error. In the 
case of uniform cluster size, not only the order of the threshold, but also the constant is identified. 

I. Introduction 

As new content mushrooms at a brisk pace, finding relevant information is increasingly a challenge. 
Consequently, recommendation systems are commonly being used to assist users: Amazon recommends 
books, Netflix recommends movies, Linkedin recommends professional contacts, Google recommends 
webpages for a given query, etc. Such recommendation systems exploit various aspects to make sugges- 
tions: popularity amongst peers, similarity of content, available user-item ratings, etc. This paper is about 
collaborative filtering using the rating matrix: we are interested in making recommendations using only 
available ratings given by users to the items they have experienced. In a practical system, such a rating 
based collaborative filter is typically complemented by content-based analysis specific to the data. 

There is vast literature on recommendation systems and collaborative filtering; see for example the 
special issue [9] and the survey paper [3]. Given the massive datasets and the lack of good statistical 
model of user behavior, the dominant stream of work has been to propose methods and demonstrate their 
scalability on real data sets. However, recently the Netflix Prize [1] has popularized the problem to other 
research communities and several researchers have started exploring provably good methods. This paper 
falls in the latter category: we deal with fundamental limits of collaborative filters. In the remainder of 
this section, we first discuss related models and results, and then outline our model and results. 

A. Related Work 

The Netflix data consists of rating matrix where the rows correspond to movies and the columns 
correspond to users. Only a small fraction of the entries are known and the goal is to estimate the 
missing entries, that is, this is a matrix completion problem. Several algorithms have been proposed 
and tested on this data set; see for example [13]. Mathematically, without any further restriction, this is 
an ill-posed problem. Motivated by this, some authors have recently considered the matrix completion 
problem under the restriction of low-rank matrices. (This problem also arises in other contexts such as 
location estimation in sensor networks.) This problem has attracted much attention, and in the past year 
a number of results have been reported. In [5], using nuclear norm minimization proposed in [16], an 
upper bound on the number of samples needed for recovery asymptotically is derived in terms of the 
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size and rank of the matrix. In [6], a lower bound is established on the number of samples needed by 
any algorithm. The order of this lower bound is shown to be achievable in [12]. In [14], the problem of 
matrix recovery from linear measurements (of which sampling is a special case) is considered and a new 
algorithm is proposed. In [4], the problem of matrix completion under bounded noise is considered. A 
semi-definite programming based algorithm is proposed and shown to have recovery error proportional 
to the noise magnitude. 

In this paper, we take an alternative channel coding viewpoint of the problem. Our results differ from 
the above works in several aspects outlined below. 

• We consider finite alphabet for the ratings and a different model for the rating matrix based on row 
and column clusters. 

• We consider noisy user behavior, and our goal is not to complete the missing entries, but to estimate 
an underlying "block constant" matrix (in the limit as the matrix size grows). 

• Since we consider a finite alphabet, even in the presence of noise, error free recovery is asymptotically 
feasible. Hence, unlike [4], which considers real-valued matrices, we do not allow any distortion. 

We next outline our model and results. 

B. Summary of Our Model and Results 

We consider a finite alphabet for the ratings. In this section, we briefly outline our model and results 
without any mathematical details; the details can be found in subsequent sections. 

To motivate our model, consider an ideal situation where every user rates every item without any noise. 
In this ideal scenario, it is reasonable to expect that similar users rate similar items by the same value. 
We therefore assume that the users (items) are clustered into groups of similar users (items, respectively). 
The rating matrix in this ideal situation (say X with size m x n) is then a block constant matrix (where 
the blocks correspond to cartesian product of row and column clusters). The observations are obtained 
from X by passing its entries through a discrete memoryless channel (DMC) consisting of an erasure 
channel modeling missing data and a noisy DMC representing noisy user behavior. Moreover, the row 
and column clusters are unknown. The goal is to make recommendations by estimating X based on the 
observations. The performance metric we use is the probability of block error: we make an error if any of 
the entries in the estimate is erroneous. Our goal is to identify conditions under which error free recovery 
is possible in the limit as the matrix size grows large. Thus we view the recommendation system problem 
as a channel coding problem. 
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The cluster sizes in our model represent the resolution: the larger the cluster, the smaller are the degrees 
of freedom (or rate of the channel code). If the channel is more noisy and the erasures are high, then we 
can only support a small number of codewords. The challenge is to find the exact order. For our model, 
we show that if the largest cluster size (defined precisely in Section HID ) is smaller than Ci log(?n?i), 
where Ci is a constant dependent on the channel parameters, then for any estimator the probability of 
error approaches one. On the other hand, if the smallest cluster size (defined precisely in Section HID ) is 
larger than C2log{mn), where C2 is a constant dependent on the channel parameters, then we give a 
polynomial time algorithm that has diminishing probability of error. Thus we identify the order of the 
threshold exactly. In the case of uniform cluster size, the constants Ci and C2 are identical and thus 
in this special case, even the constant is identified precisely. Moreover, for the special case of binary 
ratings and uniform cluster size, the algorithm used to show the achievability part does nor depend on 
the cluster size, erasure parameter, and needs knowledge of a worst case parameter for the noisy part 
of the channel. These results are obtained by averaging over X (as per the probability law specified in 
Section Hill. 

The achievability part of our result is shown by first clustering the rows and columns, and then 
estimating the matrix entries assuming that the clustering is correct. The clustering is done by computing 
a normalized Hamming metric for every pair of rows and comparing with a threshold to determine if the 
rows are in same cluster or not. The converse is proved by considering the case when the clusters are 
known exactly. Our results for the average case show that the threshold is determined by the problem of 
estimating entries, and relatively, clustering is an easier task (see Figure \T\ for an illustration). 

C. Organization of the Paper 

The precise model for X and the observations is stated in Section |II1 The case of uniform cluster size 
and binary ratings leads to sharper bounds and results. Hence results for this case are given in Section Hill 
The case of general alphabets and non-uniform cluster sizes is considered in Section |IVl The conclusion 
is given in Section |Vl while all the proofs are collected together in Section IVll 

D. Notation 

All the logarithms are to the natural base unless specified otherwise. D{ii\\u) denotes the KL divergence 
([8]) between probability mass functions fi and z^. By T = r2(/(n)) we mean that for n large enough, 
T > constant - f{n). By 1(A) we denote the indicator variable, which is 1 if A is true and otherwise. 
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Fig. 1. The figure shows lower and upper bounds for the probabihty of error under known clustering (Theorem|2l(, the asymptotic 
cluster size threshold from Theorem [T] and an upper bound on the clustering error (Theorem (S} for the case m = n = 10*', 
erasure probability e = 0.9, and binary symmetric channel with error p = 0.25. The threshold in the clustering algorithm is 
chosen to be do = (2p(l - p) + l)/3 = 0.4583. 



II. Model and Assumptions 

The main elements of our model are a block constant ensemble of rating matrices (whose blocks of 
constancy are not known) and an observation matrix obtained from the underlying rating matrix via a 
noisy channel and erasures. The noise in the observations represents the inherent noise in user-item ratings 
as well as the error in our model. The erasures denote missing entries. To be more precise, suppose X is 
the unknown m x n rating matrix with entries from a finite alphabet, where n is the number of buyers 
and m is the number of items. Let A = {Ai}^^^ and 13 = be partitions of [1 : m] and [1 : n] 

respectively. We call the sets Ai x Bj clusters and we call Ai's (Bj's) the row (column) clusters. We 
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denote the corresponding row and column cluster sizes by mj and nj, and the number of row clusters 
and the number of column clusters by r and t respectively. Thus X]i=i "^j = Si=i ~ ^■ 

We state our results under two sets of conditions - the set of conditions Al)-A4) and Bl)-B3) below. 
Conditions Al)-A4) are a special case of conditions Bl)-B3). The results under Al)-A4) are sharper and 
illustrate the important concepts more easily. Hence they are stated separately. We begin by stating and 
discussing Al)-A4) first and then we state Bl)-B3). (A few additional conditions needed in the results 
are stated at appropriate places.) 

Conditions Al)-A4): The conditions Al)-A4) below correspond to binary rating matrix with equal size 
clusters and uniform probability of sampling entries. 
Al) The entries of X are from {0, 1}. 

A2) The row (column) clusters are of equal size: mj = mo, rii = no for all i. 

A3) X is constant over the cluster Ai x Bj and the entries are i.i.d. Bemoulh(l/2) across the clusters. 

A4) The observed data Y G {0, 1, e} (e denotes erasure) is obtained by passing the entries of X through 
the cascade of a binary symmetric channel (BSC) with probability of error p and an erasure channel 
with erasure probability e. 
The cluster sizes are representative of the resolution of X - large cluster sizes correspond to a coarse 
structure with fewer degrees of freedom in choosing X, while small cluster size corresponds to a fine 
structure. Condition A2) suggests that we can think of the cluster size mono as representative of the 
resolution of X and it plays a central role in our results. If we think of all permissible X as a channel 
code, then a higher mono corresponds to a smaller rate code. However, in order to interpret mono 
precisely, we also need to take into account condition A3). When the entries of the cluster are filled with 
i.i.d. Bernoulh(l/2) random variables as per A3), it is likely that rows in two clusters turn out to be the 
same, and hence these two row clusters can be merged to form a single bigger cluster The following 
lemma shows that if the number of clusters is f](log(n)), then this happens with small probability and 
hence we should think of mono as the representative cluster size. 
Lemma 1: If t > (2 + 5) log2(n), 5 > 0, then 

P (Rows in two different clusters are same) < — ? 

^ ^ ~ 

and a similar result holds for the column clusters. 

Proof: Each row is uniformly distributed over 2* possibilities and rows in different clusters are 
independent. Hence the probability that any given pair of rows is same is 1/2*. Since there are (2) pairs, 
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we then have 

P (Rows in two different clusters are same) < , 

Since r < m, we have 

2 

Tn 

P (Rows in two different clusters are same) < . 
Hence if t > {2 + 6) log2 m for some 6 > 0, then 

P (Rows in two different clusters are same) < — ^ . 

■ 

Condition A3) also implies that in any row or column, for large matrices, roughly the number of Os 
and Is is same. This essentially implies that the opinions are diverse for any user or item. While this 
may seem unrealistic (and can indeed be fixed), we prefer the Bernoulli(l/2) model for the following 
reason: under this assumption no recommendations can be extracted from any row or column alone and 
thus collaborative filtering is necessary. Such a model is desirable for evaluation of collaborative filtering 
schemes. Moreover, one can pre-process data so that rows and columns with fraction of Is far from 1/2 
are removed (because they are relatively easy to recommend) and then assumption A3) is reasonable. 
We note that in condition A3), we only specify the probability law of X given the clusters; the clusters 
are deterministic, even though they are unknown. 

The BSC in A4) models the inherent noise in user-item ratings as well as modeling error, while the 
erasure channel models the missing data. 

Conditions Bl)-B3): These conditions are more general allowing any finite alphabet and non-uniform 
cluster sizes. 

Bl) The entries of X are from a finite alphabet A. 

B2) X is constant over the cluster Ai x Bj and the entries across the clusters are i.i.d. with a uniform 
distribution over A. 

B3) The observed data Y € A U {e} (e denotes erasure) is obtained from X as follows 

a) The entries of X are passed through a DMC with probability law q{.\.) and output alphabet 
A, resulting in X. 

b) The entries Xy are then passed through an erasure channel with erasure probability e. 
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III. Binary Rating Matrix 

In this section, we state our results under conditions Al)-A4). The main result of this section appears 
in Section IIII-AI It is obtained by studying two quantities: probability of error when the clustering is 
known (Section IIII-BI ) and probability error in clustering for a specific algorithm (Section IIII-CI ). 

A. Main Result 

Our main result stated below identifies a threshold on the cluster size above which error free recovery 
is asymptotically feasible but below which error free recovery is not possible. 

Theorem 1: Suppose conditions Al)-A4) are true and the clusters are unknown. Let pi = e + 2(1 — 
e)\/p{l — p). Suppose that e < 1 and p G [0,po]> Po < 1/2- 

1) Converse: If 

In(mn) 

mono < [1 - d)T—— — r, d > 0, 
111(1 /pi) 

then Pf. 1 for any estimator. 

2) Achievability: If i = r2(log(n)), r = r2(log(m)), limsupm/n < oo, limsupn/m < oo and 

In (mn) 

> hU^V 

then Pe for the following polynomial time estimator: 

• Cluster rows and columns using the algorithm of Section IIII-CI using the threshold do £ 
(2po(l — Po)) 1/2) (which does not depend on e, ?Tio,no). 

• Employ majority decoding in a cluster (as in Section IIII-BI) assuming the clustering to be 
correct. 

Proof: The proof is given in Section IVI-AI ■ 
The result identifies In(mn)/ ln(l/pi) as the cluster size threshold. The first part states that if the 
cluster size is too small, then any estimator makes an error with high probability. The second part states 
that if the cluster size is large enough, then diminishing probability of error can be achieved with a 
polynomial time estimator, which does not need knowledge of e,mo,no and needs only knowledge of 
a worst case bound on p. The result is reminiscent of the channel coding theorem in the context of our 
model. 

The proof of Part 1) of Theorem [T] relies on lower bounding Pg by considering the case of known 
clustering (see Theorem |2] in Section IIII-BI ). The proof of Part 2) of Theorem [T] relies on showing that 
for the average case, the probability of error in clustering is much smaller than the probability of error in 
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filling values when the clusters are known (see Theorem [3] in Section UlI-CI ). We illustrate this in Figure 
[T]by plotting various bounds: for m = n = 10^, rriQ = uq ranging from 10 to 150, p = 0.25 and e = 0.9, 
we plot 

• upper and lower bounds for probability of error when clustering is known (from Theorem |2]), 

• upper bound on probability of clustering error (from Theorem O, 

• and the asymptotic threshold In(mn)/ ln(l/pi) (from Theorem[Tll. 

It is seen that around the asymptotic threshold, the probability of clustering error is dominated by the 
probability of error in filling values under known clustering. 



B. Known Clustering 

In this section, we consider the case when the clusters are known. Under this assumption, the decoder 
only has to estimate the value in a cluster, and the minimum probability of error estimator under A3) is 
just a majority decoder. The analysis of this decoder is elementary and we state a stronger result for a 
fixed X with possibly unequal cluster sizes. Let 

s*(X) := min mj(X)nj(X), s*(X) := maxmj(X)n,(X), 

where {mj(X)} and {nj(X)} are the row and column cluster sizes in X. 

Theorem 2: Suppose conditions Al), A3) are true and in addition assume that the clusters are known. 
Let 

Then the probability of error in filling in values satisfies 



Pe|^,^(X) > 1 - exp -4^3^ ,.(x)(.4x) + l) ) ' 

(1) 

Suppose we are given a sequence of rating matrices of increasing size, that is, mn —^ oo. Then the 
following are true. 

1) If 

In(mn) 

s*(Xj > 



In(lM) 
then ^ 0. 
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2) If 




(1 — 6) In(mn) 
In(lM) 



for some 5 > 



then P^iaA^) ^ 1- 



Proof: The proof is given in Section IVI-BI 



We note that when all the clusters are of the same size (which happens with high probability as per 
Lemma [B, then the above result states that there is a sharp threshold: if the cluster size is smaller than 
ln(mn)/ln(l/pi), then exact recovery is not possible, but if it is larger, then we can make probability 
of error as small as we wish. 

Example: For m = n = 10^, rrii = rij = uq, e = 0.9, p = 0.25, this threshold corresponds to clusters 
of size about 45 x 45 = 2025. We plot the lower and upper bounds for Pe|^ 0(X) from Theorem |2] and 
the threshold in Figure [T] 

Remark: A finer analysis reveals that we can refine Part 2) of Theorem [2] (and hence also Part 1) 
of Theorem [T]) by letting 6 approach zero as m, n — oo. The result holds as long as 6m,n In(mn) — 
21nln(mn) oo. 

C. Probability of Clustering Error 

To get an upper bound on the probability of error P^, in this section we analyze a specific collaborative 
filter: we first cluster the rows and columns using the algorithm described below and then we fill in values 
using the majority decoder assuming that the clustering is correct. The majority decoder has already been 
analyzed in Section IIII-B I and for proving Part 2 of Theorem [T] we only need to analyze the probability 
of error in clustering. 

Clustering Algorithm: We cluster rows and columns separately. For rows i,j, the normahzed Hamming 
distance over commonly sampled entries is 




where Nij is the number of commonly sampled positions in rows i and j, given by 



n 



k=l 
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Let lij be equal to 1 if rows i, j belong to the same cluster and let it be otherwise. The algorithm 
gives an estimate: 



where do is a treshold whose choice will be discussed later. A similar algorithm is used to cluster columns. 
We are interested in the probabihty that we make an error in row clustering averaged over the probability 
law on the rating matrices defined as 



We note that this is a conservative definition of clustering error. As seen in Lemma \T\ there is a small 
chance that rows in different clusters may be the same resulting in the merging of two clusters into a 
larger one. The above definition of error does not account for this and declares more errors. We use this 
conservative definition of clustering error to simplify analysis. 

Theorem 3: Suppose conditions Al)-A4) are true. Let ri > 1, r2 € (0, 1) be constants and let /i* be 
the smaller root of the quadratic equation 



1, dij < do 



0, dij > do 




2iiu{l - do)h^ + {2do - -l)h + l- 2do = 0, 



(2) 



where := 2p{l — p),u = 1 — /i. Suppose the threshold do € {fj,,fi + 1/2). Let 



a, = D{n{l-en{l-ef) 
a, = D {r,{l - ef\\{l - ef) 




Then for the above clustering algorithm, 




where 




(3) 



August 18, 2009 



DRAFT 



12 



and 



=0) <min{Pi,P2} 



(4) 



Pi 




Ai(no)* + exp(-ain) + A2(no)* 



(5) 



P2 = exp (-03*) 



(6) 



for a positive constant 03. 



Proof: The proof is given in Section IVI-CI 



The proof uses the union bound and considers pairwise errors. The pairwise errors consists of two 
cases: error when the pair of rows is in the same cluster and error when they are in different clusters. The 
probability of the first kind of error is exponentially decaying in n. The probability of the second kind 
of error is upper bounded by the minimum of Pi and P2: while Pi is tight for finite n and large p, e, 
the bound P2 is useful for establishing asymptotic results (like Theorem [T]) for all p, e. For example, in 
Figure [T] the upper bound on clustering error is dominated by Pi , while the proof of Part 2) of Theorem 
[T]uses P2. We note that both Pi and P2 have terms that decay exponentially in n as well as t. The terms 
decaying exponentially in t are related to Lemma [T] and the conservative definition of clustering error as 
discussed before the statement of Theorem [3] These terms are the origin of the t = r2(log(n)) condition 
in Part 2) of Theorem[T]and can perhaps be avoided with more sophisticated analysis; however, we prefer 
to work with this condition since as per Lemma [TJ the condition t = n(log(n)) is anyway needed for 
interpreting mono as the representative cluster size. 

IV. General Finite Alphabet and Non-uniform Clusters 

In this section, we consider a general finite alphabet A and non-uniform cluster sizes. We work with 
assumptions Bl)-B3) described in the Section |ll] and generalize the results in Section Hill To state our 
results, we first introduce some notation. For g G A, define 



If Ai, A2 are i.i.d. uniform on A and we pass them through the DMC q{-\-) to get outputs Ai, A2, then 




(7) 





P,(JGA 
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The following useful lemma sheds light on the relationship between dn, and dub- 
Lemma 2: For any DMC, dub > dn,, with equality iff q{y\p) = q{y\q) V p,q,y ^ A. 

Proof: The proof is given in Section IVI-EI ■ 
We next state our main result for general finite alphabet and non-uniform cluster size. 
Theorem 4: Suppose conditions Bl)-B3) are true and the clusters are unknown. Then there exist 

constants pi,P2 G (0, 1), pi > P2 such that 

1) Converse: If 

In(mn) 

maxmiUj < (1 - VJriTl — V' > 0' 
i,j ln{l/p2) 

then Pf. I for any estimator. 

2) Achievability: Suppose that there exist some y,p,q £ A such that p ^ q and q{y\p) ^ q{y\q)- (By 
LemmalU this ensures that dib < dub-) If ?^^/ (^i + + . . . + n^) = r2(log(m)), rn^/ (mf + + 
. . . + m^) = r2(log(n)), limsupm/n < oo, limsup?i/?7T, < oo and 

In(mn) 

mmrmnj > 
1,3 In(lM) 

then Pg ^ for the following polynomial time estimator: 

• Cluster rows and columns using the algorithm of Section IIII-CI using the threshold do G 
{dib,dub) (which does not depend on e,mi,nj). 

• Employ maximum likelihood decoding in a cluster assuming the clustering is correct. 
Proof: The proof is similar to Theorem [H we now use Theorems [5] and [6] in place of Theorems |2] 

and [3] respectively. ■ 
The above result again identifies In(mn) as the exact order of the cluster size threshold for asymptotic 
recovery. Similar to the binary alphabet and uniform cluster size case in Section Hill the constants pi, p2 
arise from the case when the clusters are known (see Theorem |5] below). The gap between the constants 
Pi,P2 can be made arbitrarily small: the proof of Theorem |5] identifies a constant Ci (see equation ( |29l )) 
such that for any S > 0, 

pi = e + (1 - e) exp(-Ci +6), p2 = e + {I - e) exp(-Ci - <5) 

is a valid choice in Theorem |4l 

We next consider the case when the clusters are known and extend Theorem |2l 

Theorem 5: Suppose conditions Bl)-B3) are true and in addition assume that the clusters are known. 

Also let 

, , ln(l/2|A|) 
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where pi,p2 are as defined above. Then for a sequence of rating matrices of increasing size mn — > oo, 
the following are true. 



1) If 



, , In(mn) 



In(lM)' 



then Pe|^,B(X) ^ 0. 
2) If 

then Pe|^,B(X) ^ 1. 

Proof: The proof is given in Section IVI-DI ■ 
Finally, we study the performance of the clustering algorithm and extend Theorem [3l 
Theorem 6: Suppose conditions Bl)-B3) are true and in addition suppose that there exist some y,p,q € 

A such that p q and q{y\p) / q{y\q)- (By Lemma |2j this ensures that dn, < d^b-) If we choose the 

threshold d^ G {dib,dub), then 

Pe,rc < c' ""^"^' exp (-cnV(n? + + . . . + n?)) , (9) 

for some positive constants c, c'. Consequently, if n^/ (nf + + . . . + n^) = 0(log(m)), then Pe,rc —>■ 
as m, n ^ oo. 

Proof: The proof is given in Section IVI-FI ■ 



V. Conclusion 

We take a channel coding perspective of collaborative filtering and identify the threshold on cluster 
size for perfect reconstruction of the underlying rating matrix. The result is similar in flavor to some 
recent results in completion of real-valued matrices. The advantage of our model is that the proofs are 
relatively simple relying on Chemoff bounds and noisy user behavior can be easily handled. 

In the typical applications of recommendation systems, there is a lack of good models. We believe that 
our model has two characteristics that make it suitable for analytical comparison of various methods: 
a) in our model the user opinions are diverse and no single user/item reveals much information about 
itself, that is, collaborative filtering is necessary; b) as we have shown, the model is analytically tractable. 
There are several directions where this model may turn out to be useful: analysis of bit error probability 
instead of block error probability, analysis of local popularity based mechanisms, etc. 
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VI. Proofs of Results 



A. Proof of Theorem |7] 

The proof is based on Theorems |2] and |3] 

When A, B are known, under our model all feasible rating matrices are equally likely. Hence the ML 
decoder gives the minimum probability of error and so we have Pe > £^[Pe|^,B(X)]- To prove Part 1), 
we lower bound i?[Pe|^,B(X)]. Let T be the event that s*(X) > uiqUq. Proceeding as in Lemma [T] we 
have for t > {2 + 6) log2(n), r > {2 + 5) log2(m), 6 > 0, 

Pr(r) < ^ + ^. 

m° n° 

Hence Pr(T) 0. Now, 

i?[P,|^,i5(X)] >i^[Pe|AH(X);T'=]. 



But on the event T*^, s*(X) = mono and hence we get 



Pe > i?[Pe|Ai5(X)] > (1 - Pr(T))Pe|AH(X)- 

But from Part 1) of Theorem HI Pe\A,B(^) ~^ ^ 

mono < (l-S) ; \ , 6>0. 
ln(l/pi) 

This proves Part 1). 

Next we prove Part 2). Let D denote the event that the clustering is identified correctly. We note that 
the probability of error in estimating X averaged over the probability law on the block constant matrices 
satisfies 

Pe<i?[Pe|AB(X)Pr(^)+Prp^)] 

< E [Pe|AB(X)] + {Pe,rc + Pe,cc) 

where Pe,cc is the probability of error in column clustering. The desired result follows from Part 2) of 
Theorem |2] and Theorem [3] ■ 
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B. Proof of Theorem |2] 

Suppose in cluster Ai x Bj we have s non erased samples. Then the probability of correct decision in 
this cluster is given by 

LfJ 



Pr(i^f,,,J = E {%''i'^-Py~' if ^ is odd 

ij=0 



<?=0 

s-g (11) 



g=0 

+ - (1 — p)3 if s is even. 

Averaging over the number of non erased samples, the probability of correct decision in cluster Ai x Bj 
is given by 

MElj) = ( - eYVr{El^^^). (12) 
Since the erasure and BSC are memoryless 

r,t 

=1- n ^^Kj)- (13) 

i=ij=i 

Equations (fTTI) . (fT2l) . and ([T3] | specify the probability of error. 

Upper Bound: The desired upper bound is obtained by deriving a lower bound on PT{Efjg). First we 
note that from (fTTI) . 

r 2^ 

But for < p < I and (/ > f , — pY''^ < {I — p)^ ■ Substituting this in the previous equation, we 
have 



Pr(S^^. J > 1 - (2Vp(r^)^ (14) 
From Equations (fT2l ) and (fT4l ). we have Pr(£^,?^) > 1 — ^"1"'"' and so from (fT3] ). 

j'eMx)<i- n (i-^^ro- 

i=lj=l 

We note that for x G [0, 1/2], 1 - x > exp(-2 ln(2)2;). Hence 

exp(-21n(2) pT'A < fl (l-pTO- 
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Where the first inequaUty holds for p^'"^ < 1/2. This is true since s*(X) > ln(2)/hi(l/j*i). The upper 
bound follows by noting that 

Pi < rtp^ < '. 

«=lj = l 

Lower Bound: The lower bound on Pg|^ g(X) is obtained from an upper bound on Pr(£'?^^). From 



If s is even, we have 



For s odd, 



and so 



-2{s + l) P P) 



1 



l-Pr(i^f,,-J>^^j^(2vMr^) • (15) 



h{\s/2]/s) = /i(l/2 + l/2s) > 1 - 1/s 



2 



From ([TSll and ([T6l ). we have for all s. 



Now from (fT2b . 



mirtj ^ , 



miUi 
<1 ^1 



A{minj + 1) Y 1 — p 
Using this bound on Vr{E^ ■) in (fT3l) . we have 



I 4 Y 1 - p ..f-'.^ miTij + 1 I 



^1 (1 / P . Pi 
> 1 — exp — - 4 / rt 



(X) 



4Y 1 -P s*(X) + 1 

s-(X) 



> 1 — exp \ — — \ rnn 



A\jl-p s*(X) + ir 
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where in (fTSl) we have used 1 — x < exp(— rc). This completes the proof of ©. 

Asymptotics: Now consider a sequence of rating matrices of increasing size. The upper bound on error 
in ([Hi is a decreasing function of s*(X). Hence if 

In(mn) 

then 

21n(2)ln(lM) ^ 



Now suppose 



Pe\A,BpQ - w \ 

' m(mnj 



^ (l-5)ln(mn) ^ 

s*(X) < ^ / , \ — -, for some 6 > 0. 

ln(l/pi) 



The lower bound on error ([T|l is a decreasing function of s*(X), and hence substituting the above upper 
bound on s*(X), we have 

(C2 + ln(mnjj [n{mn) ) 

where ci, C2 are some positive constants. Hence Pe|^,B(X) ^ 1 as vnn oo. ■ 

C. Proof of Theorem 13 

Recall that Nij is the number of commonly sampled positions in rows i and j, given by 

n 

Nij = Y,l{Y,k^e,Y^k^e). 

k=l 

From the Chemoff bound [10, Theorem 1], we have 

Pr {Nij > nri(l - e)^) < exp {-nD (ri(l - ef\\{l - ef)) = exp(-nai), and (19) 
Pr {Nij < nr2(l - e)^) < exp {-nD (r2(l - ef\\{l - ef)) = exp(-na2). (20) 

To get a handle on the probability of error, we first analyze it conditioned on the erasure sequence and 
X. Let E denote the erasure matrix: 

^=[i(y,, = e)U„G{o,ir-. 

Rows in Same Cluster: Consider rows i,j of X and suppose lij = 1, i.e. i,j axe in the same cluster. 
We wish to evaluate the probability of error Pr{dij > (io|/ij = 1, X). In this case, the random variable 
N ij dij is given by 
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For any column k such that Yj^ ^ e,Yjk / e, the indicator l(Yjfe 7^ ^jk) has mean /_f = 2p{l — p). 
Hence, the above summation has Nij i.i.d Bernoulli random variables of mean /x. An application of 
Chemoff bound [10, Theorem 1] yields 

Pr(^dij > do\I^J = 1,^,X) < exp(-iV,jD(do||/i)) . (21) 



The bound is independent of X. We only need to take the average of (|21b with respect to E. Using (1201 ). 
we have 



Pr (^1 



= 1 



< exp (-n(l - e)'^r2D{do\\n)) + exp(-na2), 

< 2exp(-nmin{r2(l -e)2L'((io|||/i),a2}) . (22) 

Rows in Different Clusters: Next consider the case lij = 0, i.e. rows i and j are in different clusters. 
We wish to evaluate Pr{dij < (io|/jj = 0, E, X). For I^j = and fixed E, X, the random variable Nijdij 



is given by 



Nijdij= Yl l(Ya./Y,fc)+ Yl l(Y^fc/Yjfc). (23) 

Xifc=Xjfc XifcT^Xjfc 

Note that for any column k such that 7^ e, Yj^ 7^ e, the indicator l(Yjfc 7^ Yj^) has mean 
• 2p{l -p) = i_i if Xjfc = Xj-fc, and 

Define Sij as the number of columns k such that Yjfc 7^ e, Yj^ 7^ e and Xj^ 7^ Xj^. Then from (l23l).we 
observe that the first sum in (l23l ) has A'jj — Sjj i.i.d Bernoulli random variables of mean and the second 
sum has Sij i.i.d Bernoulli random variables of mean v, all the random variables being independent. 
Using the Chernoff bound, we may then write 



Pr I dij < do 



hi — 0, i?, X 



+ ve^)^'' (1-11+ ue^)^'J-"^'J 
^ ^ XnJ ' f"'- ^ ^ 0- (24) 

By substituting h = 1 — exp(0), we can rewrite the above bound as 

Pr [di, < do lij = 0, E, Xj < ^ li\)d!Nj ' for < /i< 1. (25) 

We are free to choose < /i < 1 in the above bound. We choose h such that the bound is optimized for 
the average case Sij = Nij/ 2. For this case, the bound in ( [25l ) reduces to 

■(l-i//i)(l-^/i)\^-/^ 



(1 - /l)2rfo 
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The value of h that minimizes this bound can be checked to be the smaller root of the quadratic given 
by ©. 

Next, we take expectation in (1251 ) with respect to the erasure sequence E. Let Sij denote the number 
of columns k such that Xj^ ^ Xj^. Then we have, from the Chemoff bound, as in (l20ll . 



Pr {Sij > Sijri{l - ef) < exp{-Sijai] 



(26) 



Now, from ( |25] ). we have 
Pr (^dij < do 
Now, since jj, < u < 1, we have 



= 0,E,X]<{ ,^^Y'' ( i^V'^ , for < /. < 1. 



(1 - hY" 
1 - uh 



1 — fih 



1 — fih 



< 1, for h> 0. 



First note that the function f{h) = (1 — fih)/{l — h^" for h € [0, 1) has derivative 

X _ dp- fi + fih{l - dp) 
^ ~ (1 - h^o 

Since ^ < dp < 1, f'{h) > and so f{h) > /(O) = 1. Hence (1 - ^-^h)/{l - hY" > 1. Now if 
Sij > Sijr2{l — e)^ and Nij < nri(l — e)^, then 



Pr ( dij < dp 



1 — iih 



, for < /i< 1. 



Combining this with (l26l ) and (fT9l ). we have 



Pr ( < do 



+ Pr {Sij < Sijr2{l - ef) + Pr {Nij > nri(l - e)^) , 
< I — T—r- I I I + exp(— Sjja2) + exp(— nai^ 



(1 - 



1 — fih 



(27) 



Since Sjj = no-'^, where X is Binomial(t, 1/2), we have 



E[exp(As,,)] = E [exp(AnoX)] = j^i±^^|(^^ 



Now taking expectation with respect to X in (1271 ). we have 



P{Iij = l\lij = 0)=Pr( dij < dp 



= 



1 _ nh \ '^''1(1-^)' 

< I I Ai(no)* + \2{npf + exp(-ain) = Pi. 



(1 - hY^' 



(28) 
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It remains to show that 

P{iij = l\I^j=0) < P2. 

This result follows from Q of Theorem |6] for the general case, which is proved in Section IVl-FI 



D. Proof of Theorem \5\ 

For simplicity let Yi,...,Ys denote the s samples in block Ai x Bj. Let j^La := q{-\a), a € A be the 
transition law of the channel for input a and let denote the empirical probability mass function (PMF) 
of li, ...,Ys. Let Eijs be the error event when the (i, j)th block has s samples. For simplicity let Vg 
denote the set of types with denominator s [8, pp. 348] and define the set of PMFs: 

Upper Bound: Then 

1 



aeA 



' ' a,beA,a^b i/eUa,b 

where in the second step we have used the union bound and in the last step we have used [8, Theorem 
11.1.4, pp. 354]. Let 



Ci := lim --In f exp {-sD{u\\fia)) ] 

s^oo s I ^ — ' ^ — ' / 
\a,beA,aj^b ueUa.,b ) 



min min D{v\\11q). (29) 

a+b {v:D{v\\iLb)<D{v\\iL^)) 



Then for 5 > small, for s > sq{S), we have 



exp(-(Ci-5)s) 



while for s < sq we can bound this probability by 1. Hence we have from (fT2l ). 

exp(-(Ci-<5)s)' 



l{s > so) 1 



E 



> E 



E 



exp(-(Ci-5)s) 



exp(-(Ci-5)s) 



E 



Hs < so) 1 



exp{-iCi-5)s) 
|A| 



E[l{s<so)] 



Pr(s < So). 



(30) 
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But for large enough mirij using the Chernoff bound [10, Theorem 1], 

Pr(s < So) < exp {—minjD{so/minj\\l — e)) . (31) 
As rriinj oo, D{so/minj\\l — e) ^ ln(l/e). Hence given any rj > 0, for large enough miUj, we have 

D{so/minj\\l - e) > ln(l/(e + i])). 

Hence, from dBOl l and ( [3T| ). 

Pr(EJ) > E (l - _ (, + ,)™.., 

where we have used the fact that s is Binomial(mjnj , 1 — e) and so the binomial expansion. Note that 
e < pi, and hence we can choose r] so that e + rj < pi. Hence we have 



1 _ 

Using (fT3]) . we then have 



P,|^,MX)<1- n (l-2pr"V|A|) 

i=l,j=l 

Sl-expf-ll^ i: (32, 

where in the last step we have used 1 — x > exp(— 2 ln(2)x) for x S [0, 1/2]. Note that for large enough 
rriirij, we have p™'"^ < 1/2. But using 



r t 

Pi < rtp-^ < , , Pi 

■1-1 5*1,^1 
1=1,3=1 



we have. 



The RHS in (l33l) is a decreasing function of s^,(X). Hence if 

In(mra) 

then 

41n(2)ln(l/pi) \ 



Pe|Ae(X) < 1 - exp 1.| / ^f^^ ^ 0. 

' |A| ln[mn) J 
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Lower Bound: Next we give a lower bound on Pr {Eijs). If for each a we consider some b ^ a, then 
we get 



|A| 

exp {-sD{i^\\fia)) 



> 



lAI ^ ^ (s + 1)1^1 



^ |A|(„4 + 1)|A| E E exp(-.I)(.||^.)), 

where again we have used [8, Theorem 11.1.4, pp. 354] in the third step. Since we are free to choose b, 
we choose it such that 



Then we see that 



6 = argmin min Dii'liua)- 

bj^a {u:D(u\\iit)<D{iy\\fia)} 

lini --In j ^ ^ exp{-sD{u\\fia)) j = Ci. 



Hence for 5 > 0, for s > si{6), 

^A^r,s}> |A|(^.^^. + 1)|A| 

and for smaller s we use the trivial bound that the probability is non-negative. Hence we have from (fT2l ). 

Pr(E<,) = E iPr(Ey) < E (l - ^^^f^) + Pr(. < »0 (34) 



rriin 
P2 



- 1 ~ TT77 , -,MAi + exp {-minjD{si/minj\\l - e)) (35) 

\f\\{minj + Ijl'^l 

rriinj 



miTii 



I I 

where in (1351 ) we have used the Chernoff Bound [10, Theorem 1], in (l36l ) we have used the fact that 
D{si/minj\\l — e) — > ln(l/e) monotonically and in the last step we have used niirij > 0. Further, from 
dHll, we have 

niirij 
rriinj ^ P2 



2|A| ' 
and hence 

rriirij 
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Using ([T3] ). we then have 



wx)>i- n (i-w 

i=i,j=i ^ 

>i---p(-^ E ^^rO' (37) 



i=i,j=i 

where to obtain (l37l) we have used 1 — x < exp(— x). Now since 

r t 



EmiUi ^ . s*(X) ^ s' 
i=l,j=l ^ ' 



we have 



D ^ 1 ( ^*(X) 

The RHS above is a decreasing function of s*(X), and hence if 

*/^\ ^ (l-C)ln(mn) _ ^ 
s*(X) < — , 7 , \ — ^, for some 5 > 0, 
ln(l/p2) 

we have 

n„,«(X)>l-exp(- J^l^) 

and hence Pe|_4 g(X) — > 1 as mn oo. 



E. Proof of Lemma |2] 
We recall 

Adding and subtracting the terms corresponding to y = z, we have, 

o,q y,z 

EE^(2/|pM2/k)/|Ap 



p,<? y 

\ 2 



5^ q{y\p) /\A\^-Y^Y1 Q(y\p)<iiy\<i)m' 



\p,y / p,q y 
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Now, J2p y Q{y\p) is t^*^ sum of all entries of the transition probability matrix, and hence is equal to |A|. 
So we have 

PA y 

= ^-Y.{Y.i^y\p)] (38) 

y \ p / 

Similarly 

dib = ^^lpp/\^\ = ^^q{y\p)q{z\p)/\^\. 
P P y+z 

Adding and subtracting the terms coresponding to y = z, we have, 

= E E ^^y\v)Mp)m ^'(y\p)/\^\ 

p y,z P y 

= e(E'?(?^i^')) /iai-EE«'(?^i^')/iai 

p \ y ) p y 

= 1-EE«'(2/|P)/|A|. (39) 

y p 

In the last step we have used 'Yliy (l{v\v) = 1 for the first term. From (1381 ) and ( [39l ). we have 



dub - '^/fc = 7^ E 1^1 E i^^y\p^ - ( E '^(^1^') ) 

y \ p \ p / 



From the Cauchy-Schwarz inequality, 

{y,q{y\v)\ < |A|^g2(y|p), 
\ p / p 

with equality iff = for all p, q. The result then follows. ■ 

7^ Proof of Theorem |6| 

We begin with a lemma that provides some useful upper bounds. 

Lemma i: Let ^1,^2,^3,...,^^ be i.i.d with mean 11 such that < Zj < 1, V i G [1 : t]. Let 
mi, 1712, ■ ■ ■ ,mt and m be positive integers such that J2i = Let 



1 * 

(3 = — y^niiZi. 



i=l 



Then the following hold for sufficiently large n. 
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1) For do > /i, 



Pr(/3 > do) 

< exp {-2{do - fifm^/{ml + + . . . + m?)) . (40) 



2) For do < /i, 



(41) 



Pr(/3 < do) 

< exp (-2(do - fi)'^m'^/{ml + ml + . . . + mf)) . 
3) For any positive constant c, there exists a positive constant a such that 

E {exp{-cm{(3 - do)^)) 

< exp {-a{do - fi)'^m'^/{ml + ml + . . . + mf)) . (42) 
Proof of Lemma\3} (l40l) and (|4TI) are direct applications of the Chernoff bound [10, Theorem 2]. (This 

particular form is also known as Hoeffding's inequality.) To prove (l42l ). first assume that do > /i. Then 

(exp(-cm(/3 - do)^)) 

< Pr(|/3 - do I < (do - ^i)/2) + exp(-cm(do - fi)^/A) 

< Pr(/5 >(io - (do - /i)/2) + exp(-cm(do - /x)V4) 

< exp (-(do - /u)^m^/2(mf + + . . . + ttIj )) 
+ exp(-cm(do - Ai)^/4) from (l40l l. 

Now, from the Cauchy-Schwarz inequality, we have 

m^/ {mf + + . . . + mf) < t < m. 
This gives with a < min{l/2, c/4}, 

E (exp(-cm(/3 - do)^)) 

< exp (-a(do - ^i)'^m^/{m\ + ml + ... + mj)) , (43) 

for sufficiently large n. 

To prove (l42l ) in the case do < n, first note that (|40l ) and (|4TI ) hold even when the random variables 
take values in [—1, 0]. Then apply the above result for the random variables —Zi^ i G [1 : t]. ■ 

As in the proof of Theorem |3l we first analyze the probability of error conditioned on the erasure 
sequence and X. Let E denote the erasure matrix. That is, 

^ = (i(^^. = ^)LxnMo,ir^". 
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Rows in Same Cluster: First consider case when lij = 1, i.e. i,j are in the same cluster. We wish to 
evaluate the probability of error Fr{dij > doj/ij = 1,E,X.). Define Sij{p,p) as the number of columns 
k such that Yj^ ^ e, Yj^ ^ e and Xj^ = p. Clearly, 

^^Sijip^p) = Nij. 

p 

Note that for such k, the indicator 7^ ^fe) has mean fipp. Hence, for lij = 1 and a fixed E, the 

random variable Nijdij is given by 



jk) 



Xifc=p=Xjfc 



The above summation has Sij{p,p) i.i.d Bernoulli random variables of mean fipp, for each p € A, all 
the random variables being independent. Hence the charcterstic function of Nijdij (for I^j = 1, fixed E 
and X) is given by 



11(1- fipp + fippe'y^^(P'P\ 

pGA 



Using the Chernoff Bound, we have 



dij > do 



hi — ^,E,X. 



< 



ll^{l-^pp + ^,ppe<>y'^ (P'P) 



-, for any 9 >0. 



By using the inequality 1 + x < e^, we obtain 



hj — 1, -E, X 



Pr i^dij > do 

< exp (NijP,j{e'^ - 1) - Nijdoe) , 9>0, 



where 



Using 



max{ln(do//3ij),0} if / 
oo if f3ij = 0, 



(44) 
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we obtain 



Pv{dij > do\lij = l,E,X) 

exp (Nijido - Pij) + Nijdo In (^)) if < < 



< < 1 if Pi, > do (45) 
if Pi, = 0. 

For tractability, we further simplify this bound. To do so we note that for — 1 < j; < and < c < 1/2, 
the function f{x) = ln(l + x) — x + cx"^ is increasing. This can be seen by noting that 

x{2c - 1 + 2cx) 



l + x 



Since x/(l + x) < and 2c — 1 + 2cx < 0, we have f'{x) > 0. Hence hi(l + x) — x + cx"^ < in the 
interval — 1 < a; < 0. Now for < [3ij < do, —\ < {Pij — dQ)/do < 0, and so 

do J ~ do \ do 



Using this in (1451 ). for do > Pij, we have 

Pr [dij > do\lij = 1,E,X.) < exp | —cNiy 
Taking expectation over X, we obtain 



Pr ( dij > do lij = l,E 



iPij - do)' 



do 



(46) 



<PT{l3ij > do\l^j = 1,E) 

iPij - do? 



+ E 



exp —cN^ 



hj — ^,E 



(47) 



=: Ti + T2. 



We next bound Ti and T2. 

For Z € [1 : t], let ni{E) denote the number of commonly sampled positions for rows i and j in the 
column cluster, i.e. 



k in cluster / 



Note that 



Y,ni{E)=N,j 



1=1 
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and 



Sij{p,p) = ^n,(£;)l(Xi{;} =p), 



1=1 



where Xj^;} is the rating vector of user i in the I column cluster. From (|44l) and the above equation, 

t 



peA 
t 



Yl ^^^^^ ^ppi(^i{o = p)^ 

1=1 pGA 



(48) 



where the random variable 



Zi = ^//ppl(Xi{i| =p) 

takes the value with probability l/jA|, for each p £ A. The mean of Zi is di^ = /ipp/|A|. Further, 
Z/'s are i.i.d. From (|48] ). Lemma |3] can be applied to /? = Z^, mj = ni{E). Using (l40l ) of Lemma 
m we have 

Ti < exp (-ai((io - d«6)'iV/,/(n?(^) + nl{E) + . . . + 
for some positive constant oi . Similarly using (l42l ) of Lemma |3l we have 

< exp (-a2(do - dib)2iv2 /(n2(^) + + . . . + ni{E))) 

for some positive constant 02. From (l47l ). we then have 



Pr ( dij > dc 



< 2 exp {-a{do - di^f Nf^ / {nl{E) + nl{E) + ... + nl{E))) 
for some positive constant a and for sufficiently large n. Using ni{E) < ni, v/e can loosen the bound to 



Pr ( dij > do 



I-ij — 



< 2 exp {-a{do - dibfNfj/{nl + + . . . + n^)) . 
Taking expectation over E, for a = r2(l — e)^ and suitable positive constants ci, C2 and c, we have. 



Pr ( dij > do 



= 1 



< 2 exp [-a{do - du,f n'^ / {nl + nl + ... + nf)) 
+ Pr{Nij < na) 

< 2 exp (-cin^/(?ii + + . . . + nf)) + exp(-C2 

< c exp {—cn^ I {ui + + . . . + )) , 



n 



(49) 
(50) 
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where in (|49l ) we have used (|20] ). (pQl l is obtained by a similar argument as used to obtain (1431 ) using 
Cauchy Schwarz inequality. 

Rows in Different Clusters: Next consider the case Lij = 0, i.e. rows i and j are in different clusters. The 
bounding technique is similar to the case when lij = 1. We wish to evaluate Pr((ijj < do\lij = 0, E, X). 
Let Sij{p,q) be the number of columns k such that Yjfc 7^ e,Yjk 7^ e,Xjfc = p and Xj^ = Then for 
a /jj = and fixed E, X, the random variable Nijdij is given by 



Xifc=p,Xjfc=(j 



The above summation has Sij{p,q) i.i.d Bernoulli random variables of mean fipq, for each {p,q) G A^, 
all the random variables being independent. Using the Chernoff Bound, we may then write 



Pr (. 



dij < do 



lij — 0, ii^, X 



By using the inequality 1 + x < we obtain 



Pr^d^j < do Iij = 0,E,X] 

< exp (Nijpijie'^ - 1) - NijdoO ) , for any 6 <0, 



where 



qeA t^pq^ijiPi l) 



Using 9 = min{ln((io//5ij), 0}, we obtain 



(51) 



Pr(dij <do|/ij =0,^,X) 



< < 



exp ( Nij{do - Pij) + Nijdo In (If ) ) if Pij > do 



1 if Pij < do- 



But Sij {p, q) < Nij , and so 



(52) 



where s is defined as 



do 
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So for Pij > do, we have < (Pij — do) /do < s. But for any < c < 1/(2(1 + s)), the function 
f{x) = ln(l + x) — X + cx^ is a decreasing function on [0, s]. So we have the following 



do 



(5ij — do 



do 



Pij — 



do 



Using this in (152] ). for Pij > do, we have 

Pr {dij < do\lij = 0, E, X) < exp ( -ciV, 
Taking expectation over X, we obtain 



iPij - dof 



do 



Pr ( dij < do 



I;a = 0,E 



<PT(Pij <do 



lij = 0,E 



+ E 



exp ( —cNi. 



iPij - doY 
do 



0,E 



Then we follow the same line of arguments as in the case when lij = 1. Note that now 

t 

^ijAj = ^f^PP^ME)l{^i{l} =p) 



t 



1=1 



J2me)^ fj-ppi{y^i{i} =p), 



1=1 



where the random variable 



(53) 



(54) 



(55) 



Zl= ^pgl(Xj|;} =p)l(Xj|;} = q) 

takes the value fipg with probability l/|Ap. The mean of Zi is dub = J2pqf^pq/\^\^- Further, Z/'s are 
i.i.d Applying Lemma |3] and (l20l ) as in the case of lij = 1, we again have 



Pr ( dij < do 



= 



< c' exp iy—cv?/ {n\ +n\-\- . . . + nf)) . 
Since there are at most m(?n — l)/2 pairs of rows, the result follows by the union bound. 
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